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ABSTRACT 



REEX is a CONVERT program, realized in the CTSS LISP of 
Project MAC, for carrying out the McNaughton-Yamada analysis 
algorithm, whereby a regular expression is found describing the 
words accepted by a finite state machine whose transition table is 
given. Unmodified the algorithm will produce 4 terns representing 
an n-state machine. This number could be reduced by eliminating 
duplicate calculations and rejecting on a high level expressions 
corresponding to no possible path in the state diagram. The 
remaining expressions present a serious simplification problem, since 
empty expressions and null words are generated liberally by the 
algorithm, REEX treats only the third of these problems, and at 
that makes simplifications mainly oriented toward removing null 
words, empty expressions, and expressions of the form XuX*, AuB*A, 
and others closely similar, REEX is primarily useful to understand 
the algorithm, but hardly useable for machines with six or more states f 
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Since regular expressions forn such a convenient characterization 
of the words accepted by a finite state machine it is desirable to have 
a means of deducing the descriptive regular expression from the 
transition table of the machine* The first such algorithm was described 
by McNaughton and Yamada, and for sooe time was apparently the only such 
algorithm known. While it is conceptually quite simple, its application 
in practice can lead to grossly cumbersome expressions. Nevertheless 
its mechanization is instructive, and is the object of the CONVERT program 
REEX, 

Ko will use the following cxaspie to illustrate our discussion. 




■s 



1 through n. Ve then define regular expressions a ii recursively as 
follows. 



The states of the machine to be analyzed are supposed to be numbered, 

k 

« 

oJ t - {ocl|MCi,o) -i)u» 

ojj - {ocr|MCi,o) - j) 

k k-1 ,, k-1 , k-1 • k-1 

°ij " °ij u °ik <*& > '°kj 

By examining these definitions it can be seen that q, . is a 
regular expression representing all the words corresponding to transitions 
from state i to state ) without passing through states numbered greater 
than k. Transitions from state i to state J without any such restriction 
are then given by the expressions i'/ , and the regular expression 
representing the machine is the union of such expressions where i is the 
initial state and j belongs to the accepting set. 

The transcription of this recursive definition into CONVERT presents 
no complications. In fact # if we let the index k be the state set, and the 



indices i and j bo actual states, we may ovoid the necessity of 
numbering the states. The pattern wo arc to recognize is then the list 
(I JK)| and it will be seen in the three forms 

(i i 03 

u f 03 

(I F (X XXX)) 
The skeletons which will be substituted in the three cases will bo 
respectively a set of letters union tho null letter, a set of letters, 
and tho CONVERT forn of tho recursive formula. Wo nood a means of 
extractina tho letters causing a given transition, for which we introduce 
the patterns (U*) or (T*) # 

(U») PAT (COR* (-■- (I L I) U*) («-))) 

(T*) PAT (COR* (— (I L F) T*) {— ))) 

where 

L BUV -ATO- 

These definitions assume that the transition table TT is presented 
in tho fonn (•« (I L F) •••) whore I is the state from which the letter 
L causes a transition to F; ie M(I,L) * F. (U») and (T*) are then 
collection patterns using the bucket variable L to find all the letters 
causing transitions respectively fro© I to I , or I to F, 

If we use the symbol $ ( available in the character set of CTSS LISP, 
for the null word, we can now write the HcNaughton-Yataada algorithm in 
CONVERT form. The rules are 

((I I 0) (-WHEN- TT (U*) (UNO $ (*UNCK* L)))) 

((I F 0) (-WHEN- TT (T*) (UNO (*UNON* L)))) 

((I F (X XXX)) (UNO (-REPT- (I F (XXX)) 

(CON (-REPT- (I X (XXX))) 

(ITR (-REPT- (X X (XXX)))) 
CREPT- (X F (XXX)))))) 

The symbols UNO, CON, and ITR stand respectively for union, concatination, 

and iteration, the "polish" form of the connectors which fom regular 

expressions. In actual practico* the state set is not jjiven and mist be 

deduced froa the transition table. This is done with a bucket variable 

and collecting skeleton. 

S BUV -ATO- 

(S») PAT (C*OR* (-»- (S « S) SO (-«))) 

which is applied to the transition table. Hence our actual CONVERT program 

contains the rule 



((I F (S*J) (UNO {-ITER* J F >REPT» CI J («UNCN» S)} *1 {«*))))) 
wherein **>* is the rule sot displayed above. In this way wo take 
account of all the states in the accepting set ( and obtain tho state 
set implicitly from the bucket variable S. It follows that (REEX I F T) 
has as arguments the initial state, the set of accepting states, and the 
transition table as a list of triplets of the form (I L F), 

If we now set out to calculate some examples we begin to find 
that the algorithm is not really veiy satisfactory, mathematically 
correct that it may be, Tho basic problem is that the terminal condition 
very often leads to an empty set. Consequently if it occurs as a terra 
in the concatination, the concatinated expression will likewise be an 
empty set. However, this is not obvious from the expression which the 
algorithm produces, and some simplification must be made. We run the 
risk that we may calculs e all three terms of the concatination before 
discovering that one of then renders the result trivial, 

A second problem is that a very great deal of duplicate calculation 
can occur. In other words, the sane subexpression may arise from a 
variety of histories, and be calculated anew each time. This is the more 
tice consuming, the higher the level on which it occurs. 

A third flaw lies in the fact that the aethod is prone to produce 
redundant expressions; that is such things as the union of X and 5 
concatinated with X* in the simple case, or 110* union 0*110*, to cite a 
slightly wore complicated example. If all such redundancies were of a 
simple nature, they could be edited away, but unfortunately they become 
core and more subtle as the number of states of the machine increases. 
They arise from otherwise identical paths which do or do not include 
certain loops or branches. 

The order in which the states are listed not only nay lead to 
alternate regular expressions representing the nachine, but sometimes 
can lead to vastly different amounts of calculation even when simplifying 
and precautionaiy techniques are included. McNaughton~Yamada reccommond 
giving high numbers to heavily trafficked states, to increase the number 
of null subexpressions which can be recognized on sight. In this regard 
it is certainly profitable to find which pairs of points have no path 
whatsoever connecting them, for if there is no such path there will be 
none passing through designeted intermediate states either, and one can 
write the empty ex^r^ssion at once without proceeding through the recursion. 



To get an idea of how impressively expansive the algorithm is, 
we have to see that at each step in the recursion we generate four new 
regular expressions, bound together by various operators. Hence after 
n steps, when the recursion terminates, we have 4 expressions; a truly 
exponential growth. Moreover this number is to be multiplied by the nucber 
of accepting states. 

In the example we have cited, the regular expression corresponding 
to initial state 1 and accepting state 2 is clearly 0*10*. If we list only 
subscripts, the expression we need is (i ) fc) ■ (1 2 3}. Expanding, we 
find 



123 




The indices which have been lined out are those which are repeated, 
so that their calculation on additional tine is redundant. Moreover, the 
circled index 322 is f on the grounds that no arrow of any sort runs from 
state 3 to state 2. Hence the concatination of the last three terns will 
also bo 0, and only the first expression need be pursued, an observation 
which would immediately reduce the calculation to 1/4, 
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Even if that simplification were not made, half the terns on the 
third level and over half those on the fourth level are redundant, 
reducing the calculation to 1/4. Both simplifications are reasonably 
typical of more complex expressions. For the aoaent let us use the 
simplification afforded by 322 * 0, to write 

123 - 122 

■ 121 u 121-221^-221 



now, 



and 



121 ■ 120 u 110*110*. 120 

- 1 u (0 u $) (0 u $}*! 



221 • 220 u 210-(110)*-120 
■ CO u $} u 0.(0 u $)M 

Here we see on the lowest level quite simple expressions written 
in a very' cumbersome way. For example, 121 simplifies to 0*1, while 
221 is CO u S), Thus 

123 ■ 122 ■ 0*1 U 0*1(0 u $)*(0 u S) 
- 0M0* 
Thus the higher levels continue to contribute clumsiness, even though 
we finally obtain the obvious result. 

Our present program makes no attempt to avoid duplicate calculations- 
nor to exclude those which are destined to produce for lack of any 
possible paths. One would think it a small sacrifice to reserve an array 
of n elements to retain this information, since otherwise the calculation 
will be far too time-consuming to treat cachines with even half a dozen 
elements. Even so, such measures seem to be destined to lower the rate 
of growth only to 2 n rather than 4 , 

However, we have studied to a slight extent the third problem, of 
simplifying the expressions produced by the algorithm. It was clear 
from examining results that some very simple redundancies were accounting 
for a substantial fraction of the complexity in the final result. The 1 
technique is to sake the regular expression operators UNO, CON, and ITR 
into functions responsible for simplifying their arguments. 



Let us review them one by one, 

CON REPT C 

(("» -") 03 
((X) X) 

C(xxx $ yyy) («rept- (xxx yyy>)) 

((XXX (CN YYY) ZZ2) (-REPT- (XXX YYY ZZZ))) 

((XXX (IT X) (UN $ X) YYY) (-REPT" (XXX (IT X) YYY))) 

((XXX (UN J X) (IT X) YYY) (-REPT- (XXX (IT X) YYY))) 

(— (CN *SAMB*)) 
) 

The significance of these simplifications are, lino by line 

A concatination involving the empty set is empty 

No sign of operation is written to concatinate one elenent 

The null word need not be written explicitly 

Concatination is associative 

(S u X)'X* - X*'($ .u X) - X* 

Dthexvise prefix notation is used with the symbol CN. 

There is already apparent in the simplification (S u X)*X* the fact 

that there are a great many equivalent forms which it is a nuisance to have 

to list separately; here we have used two rules for (S u X)"X* and 

X**(J u X), however there should be two more for (X u $)*X* and X*(X u $)• 

The unordered variables node of CONVERT is helpful in such situations, and 

is used in the simplification of the union. Ke have 



I 


UNO 


(X -I-) 


J 


UNO 


U (IT X}) 


K 


UNO 


((CN WWW) AB*) 


L 


UNO 


(B*A (CN WWW)) 


AB* 


PAV 


(CN (IT —) WWW) 


B*A 


PAV 


(CN WWW (IT -•)) 


-I- 


PAV 


(-OR- (CN X (IT 



0) (CN (IT --) X)) 

Kith these constituents, we have 

UNO REPT ((-- (-REPT- (-WON- -SAM-) -2 ( 

((XXX () YYY) (-REPT- (XXX YYY))) 

((XXX (UN YYY) ZZZ) (-REPT- (XXX YYY ZZZ))) 

((XXX J YYY J ZZZ) (-REPT- (XXX YYY (IT X) ZZZ))) 

((XXX I YYY I ZZZ) (-REPT- (XXX YYY -I- ZZZ))) 

((XXX K YYY K ZZZ) (-REPT- (XXX YYY AB- ZZZ))) 

((XXX I YYY I ZZZ) (-REPT- (XXX YYY B-A ZZZ))) 

((X XXX $ YYY) (-REPT- (S X XXX YYY))) 
((X) X) 

(.. (UN -SAME*)) 
)») 

Again we nay make a line-by-line analysis. On entry to the 
function UNO, we olininate obviously repeated arguments. Then 

The null set is deleted from a union 

Union is associative 

If X and X" appear in the union, we retain only X* 

If both X and XY' appear then wo retain only XY* 

Likewise if WWW X" and WWW appear 

Or X" WWW and www 

In a list of at least one element we place the null 
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word first, an assumption made in the simplification 

of the concatination. 
We write no operator for the union of one element 
Otherwise the prefix UN is written in prefix notation. 

Finally , the simplifications of an iteration are the following, 

ITR KEPT (((X) (-REPT- X *3 ( 

((UN XXX | YYY) (-REPTV (UNO XXX YYY))) 

CJ $) 

(0 S) 

(« (IT -SAME-)) 
)))) 
There is veiy little simplification that we have seen fit to do 

directly on an iteration. The initial transforation is used to 

overcome the fact that CONVERT functions list their arguments t even when 

there is only one* We note that $* » ()* ■ $, otherwise the expression 

is left intact. 

To see how effective these rules arc wo could consider some 

exanples, 

reex (o (i ii) ((o 1 ii) (o 1) (1 1 11} (1 lv) 

(ii 1) (il 1 iv) (iv iv) (lv 1 iv))) 

(UN (CN 1 0) (CN (US iCN 1 0) (IT (CN 1 0))) 1 (CN (UN 
(CN 1 0)) (IT (CN 1 0)) i)) 




« w i 



OulOur(OulO)(lO)*]ulu^(OulO)(10)*l] 
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reex (o Ci ii) C(o iv) (o 1 i) (i ii) (i 1 iv) 

((ii iv) (ii I i) (iv Iv) (iv 1 iv))) 

{UN 1 (CN 1 (IT (CN 1 0)) 1) CCN 1 (IT (CN 1 0)))) 




l«(lO(lO)*l)u(io(lO)*) 



As may be seen from the examples, the rules given succeed in 
eliminating almost all of the complexity due to redundant empty 
expressions and null words. Nevertheless, they are Reasonably ad hoc, 
and -do not elininate wore subtle typos of redundancy. The subject might 
be worth pursuing further to test ones understanding of the simplification 
process, but a basic fault of the aethod is that it generates such 
cumbersome; and so numerous expressions initially. Fortunately there are 
core amenable techniques availablo to form the regular expression which 
corresponds to a machine or transition system, principally the method of 
writing a series of regular expression equations for the states and 
solving then simultaneously. 
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